Bellabeat is a small high-tech manufacturer of health-focused smart products founded in 2013, they collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits.
This report will analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices to reveal more opporunities for growth.
We will be focusing on Bellabeat’s app. Their app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
The dataset used for this report is FitBit Fitness Tracker Data. The dataset is made available through Mobius in Kaggle.
These dataset were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 - 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
The dataset is licensed: CC0: Public Domain - The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
Understanding the Dataset
library(tidyverse)
library(readxl)
library(readr)
library(dplyr)
library(skimr)
library(janitor)
library(magrittr)
library(ggplot2)
library(lubridate)
library(ggpubr)
library(gapminder)
library(gganimate)
library(ggthemes)
library(transformr)
library(gifski)
We will first take a quick look at the number of rows, columns and number of unique participants.
#capturing all the dataset
files <- list.files(
path = "D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset",
pattern = ".csv",
full.names = TRUE
)
#loop to store num of rows & cols & unique IDs
numrows <- c()
numcols <- c()
numunique <- c()
for (i in 1:18) {
xzy <- read.csv(files[i])
numrows <- append(numrows, nrow(xzy))
numcols <- append(numcols, ncol(xzy))
numunique <- append(numunique, n_unique(xzy$Id))
}
#remove path name from files
files <- list.files(
path = "D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset",
pattern = ".csv",
full.names = FALSE
)
#loop to store dataset name
datasetlist <- c()
for (i in 1:18) {
datasetlist <- append(datasetlist, files[i])
}
#dataset details
excelrowscols <- data.frame(dataset = datasetlist, rows = numrows,
cols = numcols, numofparticipants = numunique)
excelrowscols[order(excelrowscols$rows),]
## dataset rows cols numofparticipants
## 18 sleepDay_merged.csv 413 5 24
## 2 dailyActivity_merged.csv 940 15 33
## 3 dailyCalories_merged.csv 940 3 33
## 4 dailyIntensities_merged.csv 940 10 33
## 5 dailySteps_merged.csv 940 3 33
## 11 minuteCaloriesWide_merged.csv 21645 62 33
## 13 minuteIntensitiesWide_merged.csv 21645 62 33
## 17 minuteStepsWide_merged.csv 21645 62 33
## 7 hourlyCalories_merged.csv 22099 3 33
## 8 hourlyIntensities_merged.csv 22099 4 33
## 9 hourlySteps_merged.csv 22099 3 33
## 15 minuteSleep_merged.csv 188521 4 24
## 1 combined dataset merged.csv 1048575 3 7
## 10 minuteCaloriesNarrow_merged.csv 1325580 3 33
## 12 minuteIntensitiesNarrow_merged.csv 1325580 3 33
## 14 minuteMETsNarrow_merged.csv 1325580 3 33
## 16 minuteStepsNarrow_merged.csv 1325580 3 33
## 6 heartrate_seconds_merged.csv 2483658 3 14
Highlights about the dataset:
Datasets are mainly broken down into:
Statistical inference of the dataset
These datasets are an example of non-probability samples as they were generated by respondents to a distributed survey via Amazon Mechanical Turk, in addition they have small samples size of between 8 to 33 samples varying across each dataset. Hence we like to highlight that there will be risk of sampling bias and that sampled units are not representative of larger target population of interest.
From earlier we see that there are some datasets with >1,048,576 rows, which exceed the maximum number of rows excel can load. Hence we will be using RStudio to process, clean and share our data.
We will be using the following dataset for our analysis:
We chose to not use the calories/intensity/steps dataset as they are subsets of the dailyActivity_merged.csv. weightLogInfo_merged.csv & minuteSleep_merged.csv are not favourable for analysis due to their small sample size, inconsistent data and incomplete metadata of the dataset.
Deeper look at our dataset
Now that we understood more about our data structures, we will process them to look for any errors and inconsistencies.
#read files
daily_activities <- read_csv('D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset\\dailyActivity_merged.csv')
sleep_log <- read_csv('D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset\\sleepDay_merged.csv')
minmets <- read_csv('D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset\\minuteMETsNarrow_merged.csv')
#check colume types
str(daily_activities)
## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(sleep_log)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(minmets)
## spec_tbl_df [1,325,580 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:1325580] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityMinute: chr [1:1325580] "4/12/2016 12:00:00 AM" "4/12/2016 12:01:00 AM" "4/12/2016 12:02:00 AM" "4/12/2016 12:03:00 AM" ...
## $ METs : num [1:1325580] 10 10 10 10 10 12 12 12 12 12 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityMinute = col_character(),
## .. METs = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
#check for duplicates
sum(duplicated(daily_activities))
## [1] 0
sum(duplicated(sleep_log))
## [1] 3
sum(duplicated(minmets))
## [1] 0
From above, we have noticed that:
We will then proceed to correct the date type as well as to remove duplicates
#remove duplicates and NA rows
daily_activities <- daily_activities %>% distinct() %>% drop_na()
sleep_log <- sleep_log %>% distinct() %>% drop_na()
minmets <- minmets %>% distinct() %>% drop_na()
#check
sum(duplicated(daily_activities))
## [1] 0
sum(duplicated(sleep_log))
## [1] 0
sum(duplicated(minmets))
## [1] 0
#correct the date type and rename column names for merging
daily_activities <- daily_activities %>%
rename(date = ActivityDate) %>%
mutate(date = as.Date(date, format = "%m/%d/%Y"))
#format as date without time as all the time in sleep_log is at 12am
sleep_log <- sleep_log %>%
rename(date = SleepDay) %>%
mutate(date = as.Date(date, format = "%m/%d/%Y"))
minmets <- minmets %>%
mutate(ActivityMinute = as.POSIXct(ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p"))
#check
colnames(daily_activities)
## [1] "Id" "date"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
class(daily_activities$date)
## [1] "Date"
colnames(sleep_log)
## [1] "Id" "date" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
class(sleep_log$date)
## [1] "Date"
colnames(minmets)
## [1] "Id" "ActivityMinute" "METs"
class(minmets$ActivityMinute)
## [1] "POSIXct" "POSIXt"
With the dates corrected and duplicates removed, we proceeded to merge daily_activities and sleep_log dataset for analysis.
#merge
daily_activities_sleep <- merge(daily_activities, sleep_log)
#check
str(daily_activities_sleep)
## 'data.frame': 410 obs. of 18 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date, format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : num 13162 10735 9762 12669 9705 ...
## $ TotalDistance : num 8.5 6.97 6.28 8.16 6.48 ...
## $ TrackerDistance : num 8.5 6.97 6.28 8.16 6.48 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.14 2.71 3.19 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 1.26 0.41 0.78 ...
## $ LightActiveDistance : num 6.06 4.71 2.83 5.04 2.51 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num 25 21 29 36 38 50 28 19 41 39 ...
## $ FairlyActiveMinutes : num 13 19 34 10 20 31 12 8 21 5 ...
## $ LightlyActiveMinutes : num 328 217 209 221 164 264 205 211 262 238 ...
## $ SedentaryMinutes : num 728 776 726 773 539 775 818 838 732 709 ...
## $ Calories : num 1985 1797 1745 1863 1728 ...
## $ TotalSleepRecords : num 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep : num 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num 346 407 442 367 712 320 377 364 384 449 ...
As sleep_log dataset only have 24 samples, the combined dataset from daily_activities & sleep_log can only have a maximum of 24 samples instead of 33.
These will be the datasets for our analysis:
During our analysis of the two datasets, we have noted that some users have recorded lesser days of data than the rest as shown below.
idlist <- unique(minmets$Id)
xminmets <- minmets %>%
mutate(date = date(ActivityMinute))
for (i in 1:33) {
dates <- xminmets %>%
filter(Id == idlist[i]) %>%
distinct(date)
print(paste('No.',i, 'ID', idlist[i], ', days of data recorded:', count(dates)))
}
## [1] "No. 1 ID 1503960366 , days of data recorded: 30"
## [1] "No. 2 ID 1624580081 , days of data recorded: 31"
## [1] "No. 3 ID 1644430081 , days of data recorded: 30"
## [1] "No. 4 ID 1844505072 , days of data recorded: 31"
## [1] "No. 5 ID 1927972279 , days of data recorded: 31"
## [1] "No. 6 ID 2022484408 , days of data recorded: 31"
## [1] "No. 7 ID 2026352035 , days of data recorded: 31"
## [1] "No. 8 ID 2320127002 , days of data recorded: 31"
## [1] "No. 9 ID 2347167796 , days of data recorded: 18"
## [1] "No. 10 ID 2873212765 , days of data recorded: 31"
## [1] "No. 11 ID 3372868164 , days of data recorded: 20"
## [1] "No. 12 ID 3977333714 , days of data recorded: 29"
## [1] "No. 13 ID 4020332650 , days of data recorded: 31"
## [1] "No. 14 ID 4057192912 , days of data recorded: 4"
## [1] "No. 15 ID 4319703577 , days of data recorded: 31"
## [1] "No. 16 ID 4388161847 , days of data recorded: 31"
## [1] "No. 17 ID 4445114986 , days of data recorded: 31"
## [1] "No. 18 ID 4558609924 , days of data recorded: 31"
## [1] "No. 19 ID 4702921684 , days of data recorded: 31"
## [1] "No. 20 ID 5553957443 , days of data recorded: 31"
## [1] "No. 21 ID 5577150313 , days of data recorded: 30"
## [1] "No. 22 ID 6117666160 , days of data recorded: 28"
## [1] "No. 23 ID 6290855005 , days of data recorded: 28"
## [1] "No. 24 ID 6775888955 , days of data recorded: 26"
## [1] "No. 25 ID 6962181067 , days of data recorded: 31"
## [1] "No. 26 ID 7007744171 , days of data recorded: 26"
## [1] "No. 27 ID 7086361926 , days of data recorded: 31"
## [1] "No. 28 ID 8053475328 , days of data recorded: 31"
## [1] "No. 29 ID 8253242879 , days of data recorded: 18"
## [1] "No. 30 ID 8378563200 , days of data recorded: 31"
## [1] "No. 31 ID 8583815059 , days of data recorded: 30"
## [1] "No. 32 ID 8792009665 , days of data recorded: 28"
## [1] "No. 33 ID 8877689391 , days of data recorded: 31"
idlist2 <- unique(daily_activities_sleep$Id)
for (i in 1:24) {
dates <- daily_activities_sleep %>%
filter(Id == idlist2[i]) %>%
distinct(date)
print(paste('No.',i, 'ID', idlist2[i], ', days of data recorded:', count(dates)))
}
## [1] "No. 1 ID 1503960366 , days of data recorded: 25"
## [1] "No. 2 ID 1644430081 , days of data recorded: 4"
## [1] "No. 3 ID 1844505072 , days of data recorded: 3"
## [1] "No. 4 ID 1927972279 , days of data recorded: 5"
## [1] "No. 5 ID 2026352035 , days of data recorded: 28"
## [1] "No. 6 ID 2320127002 , days of data recorded: 1"
## [1] "No. 7 ID 2347167796 , days of data recorded: 15"
## [1] "No. 8 ID 3977333714 , days of data recorded: 28"
## [1] "No. 9 ID 4020332650 , days of data recorded: 8"
## [1] "No. 10 ID 4319703577 , days of data recorded: 26"
## [1] "No. 11 ID 4388161847 , days of data recorded: 23"
## [1] "No. 12 ID 4445114986 , days of data recorded: 28"
## [1] "No. 13 ID 4558609924 , days of data recorded: 5"
## [1] "No. 14 ID 4702921684 , days of data recorded: 27"
## [1] "No. 15 ID 5553957443 , days of data recorded: 31"
## [1] "No. 16 ID 5577150313 , days of data recorded: 26"
## [1] "No. 17 ID 6117666160 , days of data recorded: 18"
## [1] "No. 18 ID 6775888955 , days of data recorded: 3"
## [1] "No. 19 ID 6962181067 , days of data recorded: 31"
## [1] "No. 20 ID 7007744171 , days of data recorded: 2"
## [1] "No. 21 ID 7086361926 , days of data recorded: 24"
## [1] "No. 22 ID 8053475328 , days of data recorded: 3"
## [1] "No. 23 ID 8378563200 , days of data recorded: 31"
## [1] "No. 24 ID 8792009665 , days of data recorded: 15"
With these new findings, and in addition to the risks we have highlighted above of sampling bias and that sampled units are generally not representative of larger target population of interest, we recommend for Bellabeat to either use their own data or to perform their own survey for a more accurate analysis.
Nevertheless, we can still draw some insights about the users from the analysis performed. We recommend for the following features to be added on Bellabeat’s app:
To set daily reminders to user’s device informing them to hit at least 30 minutes of moderate-intensity activities per day as per WHO’s recommendation and an hourly prompt to user to stand up and move around for 3 minutes to reduce their sedentary behaviour.
To set daily reminders to user’s device 30 minutes before their own pre-set bedtime asking them to get ready for sleep. The reminder also provides a list of dos and don’ts to facilitate their sleep such as e.g. stop using electronic devices as the blue light emitted will affect the body’s ability to prepare for sleep.